Local Winter Weather Analysis
As a skier, who grew up in the Western, NY/Southern Tier area, and stayed here, I have personally noticed the shortening and worsening (for those of use who like winter) of the winter season in my 33 years, especially now that I am coaching and need to pay particular attention to the conditions from day-to-day, or even hour-to-hour, to ensure not only a useful environment for training, but a safe one. Not only is the season getting shorter on both ends, not starting as early in December, or even in January anymore, but ending early in March instead of later into the month. With this in mind I am interested in seeing whether what I have been experiencing is purely anecdotal, or if I am having memory issues, or if there is any truth to this.
The Data In Question
NOAA, the National Oceanic and Atmospheric Administration has tools available on their website to download public weather station data from around the USA. By using the "Search Tool" I found data from 3 different weather stations that are local and significant to me in Western NY and the Souther Tier: Bath, NY; Dansville, NY; Alfred, NY. Each station has a certain "Data Coverage" percentage for each particular data field that are available. There is, unfortunately, nothing that we can do to remedy this but we can keep it in mind and and put any missing data into context. Their footnote regarding coverage states "Coverage is an approximation of total completeness based on the most complete data element, and the overall data range."
NOAA provides methods to pull data via APIs but right now I am just going to download CSVs of the data and use it here.
In each of NOAA's stations' "Daily Summaries Station Details" they list a data documentation document which also lists a supplementary document.
Weather Stations
Alfred, NY
Alfred has a single station with data available from 1893/01/01 to 2024/08/25
- ALFRED, NY US - GHCND:USC00300085 - 1893/01/01 to 2024/08/25 - 91% Coverage
Pertinent data available from Alfred, NY
Air Temperature
TMAX - Maximum Temperature - 90% CoverageTMIN - Minimum Temperature - 91% CoverageTOBS - Temperature at time of observation - 87% CoveragePrecipitation
PRCP - Precipitation - 87% CoverageSNOW - Snowfall - 80% CoverageSNWD - Snow Depth - 75% CoverageBath, NY
The Bath, NY data is comprised of data in 6 stations in different locations since 1953.
Pertinent data available from Bath, NY
Air Temperature
TMAX - Maximum Temperature - 88% CoverageTMIN - Minimum Temperature - 88% CoverageTOBS - Temperature at time of observation - 89% CoveragePrecipitation
PRCP - Precipitation - 91% CoverageSNOW - Snowfall - 90% CoverageSNWD - Snow Depth - 100% CoverageDansville, NY
Dansville has a single station with data available from 1917/07/01 to 2024/08/25
- DANSVILLE, NY US - GHCND:USC00301974 - 1917/07/01 to 2024/08/25 - 95% Coverage
Pertinent data available from Dansville, NY
Air Temperature
TMAX - Maximum Temperature - 95% CoverageTMIN - Minimum Temperature - 95% CoverageTOBS - Temperature at time of observation - 94% CoveragePrecipitation
PRCP - Precipitation - 91% CoverageSNOW - Snowfall - 84% CoverageSNWD - Snow Depth - 69% CoverageWhat is our research?
After that investigation of the NOAA website for each weather stations' available data we can garner some insight over what we might be able to investigate, all in the context of identifying a pattern showing the change in winter weather patterns over time.
I think we can investigate the following topics:
I wanted to find the number of days with rain because that highly affects both the snowpack and daily experience during winter, but the Precipitation data is a little more verbose than I can understand at the moment and will need to take more time to figure it out before I can confidently look into it.
Let's get our data
But first let's check to see that we have all the files we need in our directory and ensure we have their file name, then we can load the data.
To ensure that we can verify what data we are working with at any step, let's also add some identifying data to the dataframes. We will do this by adding columns, one with the name of the file itself, and one with a broad location identifier.
#Set file path to make following steps more readable
import os
#List files in directory
print('The files in our directory are:')
for file_path in os.scandir('Data/'):
if file_path.is_file():
print(file_path.name)
The files in our directory are: NOAA Daily Summaries - Alfred, NY.csv NOAA Daily Summaries - Bath, NY.csv NOAA Daily Summaries - Dansville, NY.csv
Okay, we have the files we downloaded
Now we can get to importing the data and using it.
#import the data
import pandas as pd
#Specify our file names
csv_alfred = 'Data/NOAA Daily Summaries - Alfred, NY.csv'
csv_bath = 'Data/NOAA Daily Summaries - Bath, NY.csv'
csv_dansville = 'Data/NOAA Daily Summaries - Dansville, NY.csv'
#import Alfred data, and add filename column
df_alfred = pd.read_csv(csv_alfred, low_memory=False) #Read file into dataframe
df_alfred.insert(0,'FILE_NAME',os.path.basename(csv_alfred)) #Insert a row with the file name
df_alfred.insert(1,'LOCATION','Alfred, NY') #Insert a row with broad Location name
#import Bath data
df_bath = pd.read_csv(csv_bath, low_memory=False) #Read file into dataframe
df_bath.insert(0,'FILE_NAME',os.path.basename(csv_alfred)) #Insert a row with the file name
df_bath.insert(1,'LOCATION','Bath, NY') #Insert a row with broad Location name
#import Dansville data
df_dansville = pd.read_csv(csv_dansville, low_memory=False) #Read file into dataframe
df_dansville.insert(0,'FILE_NAME',os.path.basename(csv_alfred)) #Insert a row with the file name
df_dansville.insert(1,'LOCATION','Dansville, NY') #Insert a row with broad Location name
#Verify the data fields in each file
print('File Name:',csv_alfred)
print('Location:',df_alfred['LOCATION'].iloc[0])
df_alfred.info()
File Name: Data/NOAA Daily Summaries - Alfred, NY.csv Location: Alfred, NY <class 'pandas.core.frame.DataFrame'> RangeIndex: 43858 entries, 0 to 43857 Data columns (total 52 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 FILE_NAME 43858 non-null object 1 LOCATION 43858 non-null object 2 STATION 43858 non-null object 3 NAME 43858 non-null object 4 LATITUDE 43858 non-null float64 5 LONGITUDE 43858 non-null float64 6 ELEVATION 43858 non-null float64 7 DATE 43858 non-null object 8 DAPR 6 non-null float64 9 DAPR_ATTRIBUTES 6 non-null object 10 DASF 1 non-null float64 11 DASF_ATTRIBUTES 1 non-null object 12 MDPR 6 non-null float64 13 MDPR_ATTRIBUTES 6 non-null object 14 MDSF 1 non-null float64 15 MDSF_ATTRIBUTES 1 non-null object 16 PRCP 42059 non-null float64 17 PRCP_ATTRIBUTES 42059 non-null object 18 SNOW 38702 non-null float64 19 SNOW_ATTRIBUTES 38702 non-null object 20 SNWD 33925 non-null float64 21 SNWD_ATTRIBUTES 33925 non-null object 22 TMAX 43496 non-null float64 23 TMAX_ATTRIBUTES 43496 non-null object 24 TMIN 43532 non-null float64 25 TMIN_ATTRIBUTES 43532 non-null object 26 TOBS 39470 non-null float64 27 TOBS_ATTRIBUTES 39470 non-null object 28 WESD 68 non-null float64 29 WESD_ATTRIBUTES 68 non-null object 30 WT01 253 non-null float64 31 WT01_ATTRIBUTES 253 non-null object 32 WT03 694 non-null float64 33 WT03_ATTRIBUTES 694 non-null object 34 WT04 254 non-null float64 35 WT04_ATTRIBUTES 254 non-null object 36 WT05 29 non-null float64 37 WT05_ATTRIBUTES 29 non-null object 38 WT06 37 non-null float64 39 WT06_ATTRIBUTES 37 non-null object 40 WT07 1 non-null float64 41 WT07_ATTRIBUTES 1 non-null object 42 WT09 2 non-null float64 43 WT09_ATTRIBUTES 2 non-null object 44 WT11 88 non-null float64 45 WT11_ATTRIBUTES 88 non-null object 46 WT14 3 non-null float64 47 WT14_ATTRIBUTES 3 non-null object 48 WT16 1 non-null float64 49 WT16_ATTRIBUTES 1 non-null object 50 WT18 4 non-null float64 51 WT18_ATTRIBUTES 4 non-null object dtypes: float64(25), object(27) memory usage: 17.4+ MB
#Verify the data fields in each file
print('File Name:',csv_bath)
print('Location:',df_bath['LOCATION'].iloc[0])
df_bath.info()
File Name: Data/NOAA Daily Summaries - Bath, NY.csv Location: Bath, NY <class 'pandas.core.frame.DataFrame'> RangeIndex: 24741 entries, 0 to 24740 Data columns (total 44 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 FILE_NAME 24741 non-null object 1 LOCATION 24741 non-null object 2 STATION 24741 non-null object 3 NAME 24741 non-null object 4 LATITUDE 24741 non-null float64 5 LONGITUDE 24741 non-null float64 6 ELEVATION 24741 non-null float64 7 DATE 24741 non-null object 8 DAPR 247 non-null float64 9 DAPR_ATTRIBUTES 247 non-null object 10 DASF 3 non-null float64 11 DASF_ATTRIBUTES 3 non-null object 12 MDPR 248 non-null float64 13 MDPR_ATTRIBUTES 248 non-null object 14 MDSF 4 non-null float64 15 MDSF_ATTRIBUTES 4 non-null object 16 PRCP 24412 non-null float64 17 PRCP_ATTRIBUTES 24412 non-null object 18 SNOW 23025 non-null float64 19 SNOW_ATTRIBUTES 23025 non-null object 20 SNWD 20704 non-null float64 21 SNWD_ATTRIBUTES 20704 non-null object 22 TMAX 17315 non-null float64 23 TMAX_ATTRIBUTES 17315 non-null object 24 TMIN 17363 non-null float64 25 TMIN_ATTRIBUTES 17363 non-null object 26 TOBS 17383 non-null float64 27 TOBS_ATTRIBUTES 17383 non-null object 28 WESD 16 non-null float64 29 WESD_ATTRIBUTES 16 non-null object 30 WESF 25 non-null float64 31 WESF_ATTRIBUTES 25 non-null object 32 WT01 287 non-null float64 33 WT01_ATTRIBUTES 287 non-null object 34 WT03 146 non-null float64 35 WT03_ATTRIBUTES 146 non-null object 36 WT04 24 non-null float64 37 WT04_ATTRIBUTES 24 non-null object 38 WT05 8 non-null float64 39 WT05_ATTRIBUTES 8 non-null object 40 WT06 9 non-null float64 41 WT06_ATTRIBUTES 9 non-null object 42 WT11 15 non-null float64 43 WT11_ATTRIBUTES 15 non-null object dtypes: float64(21), object(23) memory usage: 8.3+ MB
#Verify the data fields in each file
print('File Name:',csv_dansville)
print('Location:',df_dansville['LOCATION'].iloc[0])
df_dansville.info()
File Name: Data/NOAA Daily Summaries - Dansville, NY.csv Location: Dansville, NY <class 'pandas.core.frame.DataFrame'> RangeIndex: 37760 entries, 0 to 37759 Data columns (total 46 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 FILE_NAME 37760 non-null object 1 LOCATION 37760 non-null object 2 STATION 37760 non-null object 3 NAME 37760 non-null object 4 LATITUDE 37760 non-null float64 5 LONGITUDE 37760 non-null float64 6 ELEVATION 37760 non-null float64 7 DATE 37760 non-null object 8 DAPR 8 non-null float64 9 DAPR_ATTRIBUTES 8 non-null object 10 DASF 1 non-null float64 11 DASF_ATTRIBUTES 1 non-null object 12 MDPR 9 non-null float64 13 MDPR_ATTRIBUTES 9 non-null object 14 MDSF 1 non-null float64 15 MDSF_ATTRIBUTES 1 non-null object 16 PRCP 37319 non-null float64 17 PRCP_ATTRIBUTES 37319 non-null object 18 SNOW 32589 non-null float64 19 SNOW_ATTRIBUTES 32589 non-null object 20 SNWD 26951 non-null float64 21 SNWD_ATTRIBUTES 26951 non-null object 22 TMAX 36842 non-null float64 23 TMAX_ATTRIBUTES 36842 non-null object 24 TMIN 36884 non-null float64 25 TMIN_ATTRIBUTES 36884 non-null object 26 TOBS 36777 non-null float64 27 TOBS_ATTRIBUTES 36777 non-null object 28 WT01 139 non-null float64 29 WT01_ATTRIBUTES 139 non-null object 30 WT03 300 non-null float64 31 WT03_ATTRIBUTES 300 non-null object 32 WT04 37 non-null float64 33 WT04_ATTRIBUTES 37 non-null object 34 WT05 5 non-null float64 35 WT05_ATTRIBUTES 5 non-null object 36 WT06 5 non-null float64 37 WT06_ATTRIBUTES 5 non-null object 38 WT08 2 non-null float64 39 WT08_ATTRIBUTES 2 non-null object 40 WT09 16 non-null float64 41 WT09_ATTRIBUTES 16 non-null object 42 WT11 21 non-null float64 43 WT11_ATTRIBUTES 21 non-null object 44 WT14 98 non-null float64 45 WT14_ATTRIBUTES 98 non-null object dtypes: float64(22), object(24) memory usage: 13.3+ MB
Let's see the data we are working with
We know with our investigation of the website that each of the 3 files share column names with similar data, so we can investigate the file with the highest number of columns, which would be the file for Alfred, NY.
As we can see we only have roughly 45000 rows for the largest file, Alfred, so with scrolling cells here we can look at the data and verify what we might need to use
We can quickly see that only a small handful of the columns have non-null (rows with data) data, so we can most likely ignore those columns, either only selecting the other columns later, removing them from the dataframe, or re-creating the dataframe with only the useful columns included. We can also see that many of the columns with useful data have attribute columns - those are codified data in a multivalue structure to help understand the data better.
With so few rows we can also look at the data in Excel, Google Sheets, LibreOffice Calc, or Notepad(++). The notepad options are more difficult to understand because it will not be displaying cells.
#Display the dataframe as a table
from IPython.display import display
pd.set_option('display.max_columns', None) #Display all of the columns
print('Location:',df_alfred['LOCATION'].iloc[0]) #Show the file location above the table to verify our file
display(df_alfred) #Display the data
Location: Alfred, NY
| FILE_NAME | LOCATION | STATION | NAME | LATITUDE | LONGITUDE | ELEVATION | DATE | DAPR | DAPR_ATTRIBUTES | DASF | DASF_ATTRIBUTES | MDPR | MDPR_ATTRIBUTES | MDSF | MDSF_ATTRIBUTES | PRCP | PRCP_ATTRIBUTES | SNOW | SNOW_ATTRIBUTES | SNWD | SNWD_ATTRIBUTES | TMAX | TMAX_ATTRIBUTES | TMIN | TMIN_ATTRIBUTES | TOBS | TOBS_ATTRIBUTES | WESD | WESD_ATTRIBUTES | WT01 | WT01_ATTRIBUTES | WT03 | WT03_ATTRIBUTES | WT04 | WT04_ATTRIBUTES | WT05 | WT05_ATTRIBUTES | WT06 | WT06_ATTRIBUTES | WT07 | WT07_ATTRIBUTES | WT09 | WT09_ATTRIBUTES | WT11 | WT11_ATTRIBUTES | WT14 | WT14_ATTRIBUTES | WT16 | WT16_ATTRIBUTES | WT18 | WT18_ATTRIBUTES | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NOAA Daily Summaries - Alfred, NY.csv | Alfred, NY | USC00300085 | ALFRED, NY US | 42.27985 | -77.76643 | 586.4 | 1893-01-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00 | P,,6, | 0.0 | ,,6, | NaN | NaN | 39.0 | ,,6 | 24.0 | ,,6 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | NOAA Daily Summaries - Alfred, NY.csv | Alfred, NY | USC00300085 | ALFRED, NY US | 42.27985 | -77.76643 | 586.4 | 1893-01-02 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.65 | ,,6, | 0.0 | ,,6, | NaN | NaN | 34.0 | ,,6 | 20.0 | ,,6 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | NOAA Daily Summaries - Alfred, NY.csv | Alfred, NY | USC00300085 | ALFRED, NY US | 42.27985 | -77.76643 | 586.4 | 1893-01-03 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.30 | ,,6, | 3.0 | ,,6, | NaN | NaN | 20.0 | ,,6 | 4.0 | ,,6 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | NOAA Daily Summaries - Alfred, NY.csv | Alfred, NY | USC00300085 | ALFRED, NY US | 42.27985 | -77.76643 | 586.4 | 1893-01-04 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.20 | ,,6, | 2.0 | ,,6, | NaN | NaN | 10.0 | ,,6 | 3.0 | ,,6 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | NOAA Daily Summaries - Alfred, NY.csv | Alfred, NY | USC00300085 | ALFRED, NY US | 42.27985 | -77.76643 | 586.4 | 1893-01-05 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.30 | ,,6, | 3.0 | ,,6, | NaN | NaN | 15.0 | ,,6 | 0.0 | ,,6 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 43853 | NOAA Daily Summaries - Alfred, NY.csv | Alfred, NY | USC00300085 | ALFRED, NY US | 42.27985 | -77.76643 | 586.4 | 2024-08-20 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.08 | ,,H,0700 | 0.0 | ,,H,0700 | 0.0 | ,,H,0700 | 64.0 | ,,H | 48.0 | ,,H | 51.0 | ,,H,0700 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 43854 | NOAA Daily Summaries - Alfred, NY.csv | Alfred, NY | USC00300085 | ALFRED, NY US | 42.27985 | -77.76643 | 586.4 | 2024-08-21 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00 | ,,H,0700 | 0.0 | ,,H,0700 | 0.0 | ,,H,0700 | 61.0 | ,,H | 47.0 | ,,H | 48.0 | ,,H,0700 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 43855 | NOAA Daily Summaries - Alfred, NY.csv | Alfred, NY | USC00300085 | ALFRED, NY US | 42.27985 | -77.76643 | 586.4 | 2024-08-22 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.02 | ,,H,0700 | 0.0 | ,,H,0700 | 0.0 | ,,H,0700 | 64.0 | ,,H | 47.0 | ,,H | 51.0 | ,,H,0700 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 43856 | NOAA Daily Summaries - Alfred, NY.csv | Alfred, NY | USC00300085 | ALFRED, NY US | 42.27985 | -77.76643 | 586.4 | 2024-08-23 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00 | ,,H,0700 | 0.0 | ,,H,0700 | 0.0 | ,,H,0700 | 73.0 | ,,H | 51.0 | ,,H | 52.0 | ,,H,0700 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 43857 | NOAA Daily Summaries - Alfred, NY.csv | Alfred, NY | USC00300085 | ALFRED, NY US | 42.27985 | -77.76643 | 586.4 | 2024-08-24 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00 | ,,H,0700 | 0.0 | ,,H,0700 | NaN | NaN | 78.0 | ,,H | 52.0 | ,,H | 52.0 | ,,H,0700 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
43858 rows × 52 columns
Let's reload the dataframe with only the useful columns
#import Alfred data, and add filename column
df_alfred = pd.read_csv(csv_alfred, usecols = ['STATION','NAME','DATE','TMAX','TMIN','TOBS','PRCP','SNOW','SNWD'], low_memory=False) #Read file into dataframe
df_alfred['DATE']= pd.to_datetime(df_alfred['DATE']) #Convert the DATE column from an object in the dataframe to a Date format
df_alfred.insert(0,'FILE_NAME',os.path.basename(csv_alfred)) #Insert a row with the file name
df_alfred.insert(1,'LOCATION','Alfred, NY') #Insert a row with broad Location name
#import Bath data
df_bath = pd.read_csv(csv_bath, usecols = ['STATION','NAME','DATE','TMAX','TMIN','TOBS','PRCP','SNOW','SNWD'], low_memory=False) #Read file into dataframe
df_bath['DATE']= pd.to_datetime(df_bath['DATE']) #Convert the DATE column from an object in the dataframe to a Date format
df_bath.insert(0,'FILE_NAME',os.path.basename(csv_alfred)) #Insert a row with the file name
df_bath.insert(1,'LOCATION','Bath, NY') #Insert a row with broad Location name
#import Dansville data
df_dansville = pd.read_csv(csv_dansville, usecols = ['STATION','NAME','DATE','TMAX','TMIN','TOBS','PRCP','SNOW','SNWD'], low_memory=False) #Read file into dataframe
df_dansville['DATE']= pd.to_datetime(df_dansville['DATE']) #Convert the DATE column from an object in the dataframe to a Date format
df_dansville.insert(0,'FILE_NAME',os.path.basename(csv_alfred)) #Insert a row with the file name
df_dansville.insert(1,'LOCATION','Dansville, NY') #Insert a row with broad Location name
#Verify the data fields in each file
print('File Name:',csv_alfred)
print('Location:',df_alfred['LOCATION'].iloc[0],'\n')
df_alfred.info()
print('\n')
#Display the dataframe as a table
from IPython.display import display
pd.set_option('display.max_columns', None) #Display all of the columns
print('Location:',df_alfred['LOCATION'].iloc[0]) #Show the file location above the table to verify our file
with pd.option_context("display.max_rows", 15): display(df_alfred) #Display the data
File Name: Data/NOAA Daily Summaries - Alfred, NY.csv Location: Alfred, NY <class 'pandas.core.frame.DataFrame'> RangeIndex: 43858 entries, 0 to 43857 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 FILE_NAME 43858 non-null object 1 LOCATION 43858 non-null object 2 STATION 43858 non-null object 3 NAME 43858 non-null object 4 DATE 43858 non-null datetime64[ns] 5 PRCP 42059 non-null float64 6 SNOW 38702 non-null float64 7 SNWD 33925 non-null float64 8 TMAX 43496 non-null float64 9 TMIN 43532 non-null float64 10 TOBS 39470 non-null float64 dtypes: datetime64[ns](1), float64(6), object(4) memory usage: 3.7+ MB Location: Alfred, NY
| FILE_NAME | LOCATION | STATION | NAME | DATE | PRCP | SNOW | SNWD | TMAX | TMIN | TOBS | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NOAA Daily Summaries - Alfred, NY.csv | Alfred, NY | USC00300085 | ALFRED, NY US | 1893-01-01 | 0.00 | 0.0 | NaN | 39.0 | 24.0 | NaN |
| 1 | NOAA Daily Summaries - Alfred, NY.csv | Alfred, NY | USC00300085 | ALFRED, NY US | 1893-01-02 | 1.65 | 0.0 | NaN | 34.0 | 20.0 | NaN |
| 2 | NOAA Daily Summaries - Alfred, NY.csv | Alfred, NY | USC00300085 | ALFRED, NY US | 1893-01-03 | 0.30 | 3.0 | NaN | 20.0 | 4.0 | NaN |
| 3 | NOAA Daily Summaries - Alfred, NY.csv | Alfred, NY | USC00300085 | ALFRED, NY US | 1893-01-04 | 0.20 | 2.0 | NaN | 10.0 | 3.0 | NaN |
| 4 | NOAA Daily Summaries - Alfred, NY.csv | Alfred, NY | USC00300085 | ALFRED, NY US | 1893-01-05 | 0.30 | 3.0 | NaN | 15.0 | 0.0 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 43853 | NOAA Daily Summaries - Alfred, NY.csv | Alfred, NY | USC00300085 | ALFRED, NY US | 2024-08-20 | 0.08 | 0.0 | 0.0 | 64.0 | 48.0 | 51.0 |
| 43854 | NOAA Daily Summaries - Alfred, NY.csv | Alfred, NY | USC00300085 | ALFRED, NY US | 2024-08-21 | 0.00 | 0.0 | 0.0 | 61.0 | 47.0 | 48.0 |
| 43855 | NOAA Daily Summaries - Alfred, NY.csv | Alfred, NY | USC00300085 | ALFRED, NY US | 2024-08-22 | 0.02 | 0.0 | 0.0 | 64.0 | 47.0 | 51.0 |
| 43856 | NOAA Daily Summaries - Alfred, NY.csv | Alfred, NY | USC00300085 | ALFRED, NY US | 2024-08-23 | 0.00 | 0.0 | 0.0 | 73.0 | 51.0 | 52.0 |
| 43857 | NOAA Daily Summaries - Alfred, NY.csv | Alfred, NY | USC00300085 | ALFRED, NY US | 2024-08-24 | 0.00 | 0.0 | NaN | 78.0 | 52.0 | 52.0 |
43858 rows × 11 columns
There, that's better
Now we can see and understand things a little better, they're much more human-readable and I am, after all, a human.
From the documentation PDF from NOAA we can see that both the Precipitation (PRCP) and Snowfall (SNOW) columns are in inches.
Let's tackle our first question: How many days during a winter month had temperatures above freezing?
Winter in the northern hemisphere is classified as starting on the winter solstice (year's shortest day) and ending on the vernal equinox (day and night equal in length). That is generally December 21st or 22nd to March 20th or 21st. While I, myself, know that we've had large storms in November, let's just explore November later and stick to the entire month of December through the entire month of March for simplicity.
We'll start with some more data validation.
pd.set_option('display.max_rows',25) #show all rows for this section
df_alfred_year_rows = df_alfred.groupby(df_alfred.DATE.dt.year)['DATE'].count().reset_index(name='Rows') #Create a new frame, counting rows grouped by year
df_alfred_year_rows_rslt = df_alfred_year_rows.loc[df_alfred_year_rows['Rows'] < 365] #Create a new frame with only the years that have less than 365 rows
display(df_alfred_year_rows_rslt)
print('This many years are missing some data:',len(df_alfred_year_rows_rslt.index))
| DATE | Rows | |
|---|---|---|
| 4 | 1897 | 304 |
| 7 | 1900 | 356 |
| 8 | 1901 | 243 |
| 9 | 1910 | 245 |
| 11 | 1912 | 321 |
| 26 | 1927 | 361 |
| 38 | 1939 | 364 |
| 44 | 1945 | 363 |
| 45 | 1946 | 361 |
| 46 | 1947 | 364 |
| 49 | 1950 | 364 |
| 53 | 1954 | 364 |
| 63 | 1964 | 335 |
| 88 | 1989 | 303 |
| 89 | 1990 | 337 |
| 90 | 1991 | 334 |
| 92 | 1993 | 335 |
| 93 | 1994 | 303 |
| 106 | 2007 | 243 |
| 109 | 2010 | 120 |
| 110 | 2011 | 53 |
| 111 | 2012 | 362 |
| 113 | 2014 | 364 |
| 123 | 2024 | 236 |
This many years are missing some data: 24
Hmmmmmmmmm
Well, since we're only interested in data from winter, how about we look just at those. With the months of December, January, February, and March, we should have a total of 122 days, accounting for a leap year.
df_alfred_winter = df_alfred.loc[df_alfred['DATE'].dt.month.isin([12, 1, 2, 3])] #Create a new frame with only months in winter
df_alfred_winter_rows = df_alfred_winter.groupby(df_alfred_winter.DATE.dt.year)['DATE'].count().reset_index(name='Rows') #Create a new frame, counting rows grouped by year
df_alfred_winter_rows_rslt = df_alfred_winter_rows.loc[df_alfred_winter_rows['Rows'] < 121] #Create a new frame with only the years that have less than 121 rows
display(df_alfred_winter_rows_rslt)
print('This many years are missing some winter data:',len(df_alfred_winter_rows_rslt.index))
| DATE | Rows | |
|---|---|---|
| 4 | 1897 | 90 |
| 9 | 1910 | 31 |
| 11 | 1912 | 85 |
| 44 | 1945 | 119 |
| 45 | 1946 | 119 |
| 88 | 1989 | 90 |
| 89 | 1990 | 93 |
| 90 | 1991 | 90 |
| 109 | 2010 | 90 |
| 110 | 2012 | 118 |
| 122 | 2024 | 91 |
This many years are missing some winter data: 11
But wait, there's something wrong!
We have to massage the data a little bit. If we are trying to look at winter seasons that span the change of a calendar year, we aren't comparing apples to apples because December 2020 would stay in 2020 while January 2021 would be in 2021, when we actually want them combined.
import numpy as np
#Insert a new column
df_alfred_winter.insert(5,'WINTER_YEAR',df_alfred_winter['DATE'].dt.year)
#del df_alfred_winter['WINTER_YEAR']
#Replace the year of the date field in the new column with 1 year higher if in October, November, or December
df_alfred_winter['WINTER_YEAR'] = np.where(df_alfred_winter['DATE'].dt.month.isin([10, 11, 12]), df_alfred_winter['WINTER_YEAR']+1, df_alfred_winter['WINTER_YEAR'])
display(df_alfred_winter)
C:\Users\rjcil\AppData\Local\Temp\ipykernel_23820\15126149.py:8: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_alfred_winter['WINTER_YEAR'] = np.where(df_alfred_winter['DATE'].dt.month.isin([10, 11, 12]), df_alfred_winter['WINTER_YEAR']+1, df_alfred_winter['WINTER_YEAR'])
| FILE_NAME | LOCATION | STATION | NAME | DATE | WINTER_YEAR | PRCP | SNOW | SNWD | TMAX | TMIN | TOBS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NOAA Daily Summaries - Alfred, NY.csv | Alfred, NY | USC00300085 | ALFRED, NY US | 1893-01-01 | 1893 | 0.00 | 0.0 | NaN | 39.0 | 24.0 | NaN |
| 1 | NOAA Daily Summaries - Alfred, NY.csv | Alfred, NY | USC00300085 | ALFRED, NY US | 1893-01-02 | 1893 | 1.65 | 0.0 | NaN | 34.0 | 20.0 | NaN |
| 2 | NOAA Daily Summaries - Alfred, NY.csv | Alfred, NY | USC00300085 | ALFRED, NY US | 1893-01-03 | 1893 | 0.30 | 3.0 | NaN | 20.0 | 4.0 | NaN |
| 3 | NOAA Daily Summaries - Alfred, NY.csv | Alfred, NY | USC00300085 | ALFRED, NY US | 1893-01-04 | 1893 | 0.20 | 2.0 | NaN | 10.0 | 3.0 | NaN |
| 4 | NOAA Daily Summaries - Alfred, NY.csv | Alfred, NY | USC00300085 | ALFRED, NY US | 1893-01-05 | 1893 | 0.30 | 3.0 | NaN | 15.0 | 0.0 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 43708 | NOAA Daily Summaries - Alfred, NY.csv | Alfred, NY | USC00300085 | ALFRED, NY US | 2024-03-27 | 2024 | 0.00 | 0.0 | 0.0 | 56.0 | 39.0 | 45.0 |
| 43709 | NOAA Daily Summaries - Alfred, NY.csv | Alfred, NY | USC00300085 | ALFRED, NY US | 2024-03-28 | 2024 | 0.00 | 0.0 | 0.0 | 64.0 | 29.0 | 33.0 |
| 43710 | NOAA Daily Summaries - Alfred, NY.csv | Alfred, NY | USC00300085 | ALFRED, NY US | 2024-03-29 | 2024 | 0.00 | 0.0 | 0.0 | 49.0 | 26.0 | 29.0 |
| 43711 | NOAA Daily Summaries - Alfred, NY.csv | Alfred, NY | USC00300085 | ALFRED, NY US | 2024-03-30 | 2024 | 0.00 | 0.0 | 0.0 | 37.0 | 24.0 | 28.0 |
| 43712 | NOAA Daily Summaries - Alfred, NY.csv | Alfred, NY | USC00300085 | ALFRED, NY US | 2024-03-31 | 2024 | 0.18 | 0.0 | 0.0 | 47.0 | 28.0 | 34.0 |
14594 rows × 12 columns
That seems like we have most of the data from winter that we need
Our data may come out with some error, but with only 11 years of data missing some dates, I think we will be okay to use this data as-is.
Let's start gathering some data!
import plotly.express as px
import numpy as np
df_alfred_winter_maxoverfreeze = df_alfred_winter[df_alfred_winter['TMAX'] > 32]
#display(df_alfred_winter_maxoverfreeze)
#df_alfred_winter_maxoverfreeze_agg = df_alfred_winter_maxoverfreeze.groupby(df_alfred_winter_maxoverfreeze.DATE.dt.year)['DATE'].count().reset_index(name='Days_Above_Frezing')
df_alfred_winter_maxoverfreeze_agg = df_alfred_winter_maxoverfreeze.groupby(df_alfred_winter_maxoverfreeze.WINTER_YEAR)['WINTER_YEAR'].count().reset_index(name='Days_Above_Frezing')
#display(df_alfred_winter_maxoverfreeze_agg)
#add a new rolling/moving average column
df_alfred_winter_maxoverfreeze_agg['5YRolling'] = df_alfred_winter_maxoverfreeze_agg['Days_Above_Frezing'].rolling(5).mean().fillna(0).astype(np.int64)
df_alfred_winter_maxoverfreeze_agg['10YRolling'] = df_alfred_winter_maxoverfreeze_agg['Days_Above_Frezing'].rolling(10).mean().fillna(0).astype(np.int64)
df_alfred_winter_maxoverfreeze_agg.replace(0, np.nan, inplace=True) #Replace 0s from the rolling average back to Null (NaN)
#df_alfred_winter_maxoverfreeze_agg.info()
#display(df_alfred_winter_maxoverfreeze_agg)
fig_alfred_winter_maxoverfreeze = px.line(df_alfred_winter_maxoverfreeze_agg, x='WINTER_YEAR', y='Days_Above_Frezing', title='Number of Days With High Temp. Above Freezing in Alfred, NY')
fig_alfred_winter_maxoverfreeze.add_scatter(x=df_alfred_winter_maxoverfreeze_agg['WINTER_YEAR'], y=df_alfred_winter_maxoverfreeze_agg['5YRolling'], mode='lines', name='5-Year Rolling Average')
fig_alfred_winter_maxoverfreeze.add_scatter(x=df_alfred_winter_maxoverfreeze_agg['WINTER_YEAR'], y=df_alfred_winter_maxoverfreeze_agg['10YRolling'], mode='lines', name='10-Year Rolling Average')
fig_alfred_winter_maxoverfreeze.show()
Oh, that data between 1900 and 1910 is being cranky
Let's get rid of that and see what it looks like
df_alfred_winter_maxoverfreeze = df_alfred_winter_maxoverfreeze.loc[df_alfred_winter['WINTER_YEAR'] > 1909]
df_alfred_winter_maxoverfreeze_agg = df_alfred_winter_maxoverfreeze.groupby(df_alfred_winter_maxoverfreeze.WINTER_YEAR)['WINTER_YEAR'].count().reset_index(name='Days_Above_Frezing')
#display(df_alfred_winter_maxoverfreeze_agg)
#add a new rolling/moving average column
df_alfred_winter_maxoverfreeze_agg['5YRolling'] = df_alfred_winter_maxoverfreeze_agg['Days_Above_Frezing'].rolling(5).mean().fillna(0).astype(np.int64)
df_alfred_winter_maxoverfreeze_agg['10YRolling'] = df_alfred_winter_maxoverfreeze_agg['Days_Above_Frezing'].rolling(10).mean().fillna(0).astype(np.int64)
df_alfred_winter_maxoverfreeze_agg.replace(0, np.nan, inplace=True) #Replace 0s from the rolling average back to Null (NaN)
#df_alfred_winter_maxoverfreeze_agg.info()
#display(df_alfred_winter_maxoverfreeze_agg)
fig_alfred_winter_maxoverfreeze = px.line(df_alfred_winter_maxoverfreeze_agg, x='WINTER_YEAR', y='Days_Above_Frezing', title='Number of Days with High Temp. Above Freezing in Alfred, NY')
fig_alfred_winter_maxoverfreeze.add_scatter(x=df_alfred_winter_maxoverfreeze_agg['WINTER_YEAR'], y=df_alfred_winter_maxoverfreeze_agg['5YRolling'], mode='lines', name='5-Year Rolling Average')
fig_alfred_winter_maxoverfreeze.add_scatter(x=df_alfred_winter_maxoverfreeze_agg['WINTER_YEAR'], y=df_alfred_winter_maxoverfreeze_agg['10YRolling'], mode='lines', name='10-Year Rolling Average')
fig_alfred_winter_maxoverfreeze.show()
There might be something here
It seems like the low points of the troughs in our rolling averages has increased in the most recent dip since the turn of the new millenium, which would suggest that more days over multiple years have had highs above freezing than previous swings
It also seems like the previous years would average between 65 and 75 days a year, but starting in the 1990s that average increased to be between 70 and 80 days.
Let's try looking at the Daily Low temperature, TMIN. Maybe this makes sense as the highs could be above freezing during the day but also if the low is above freezing that could certainly make a difference.
df_alfred_winter_minoverfreeze = df_alfred_winter[df_alfred_winter['TMIN'] > 32]
df_alfred_winter_minoverfreeze = df_alfred_winter_minoverfreeze.loc[df_alfred_winter['WINTER_YEAR'] > 1909]
df_alfred_winter_minoverfreeze_agg = df_alfred_winter_minoverfreeze.groupby(df_alfred_winter_minoverfreeze.WINTER_YEAR)['WINTER_YEAR'].count().reset_index(name='Days_Above_Frezing')
#display(df_alfred_winter_maxoverfreeze_agg)
#add a new rolling/moving average column
df_alfred_winter_minoverfreeze_agg['5YRolling'] = df_alfred_winter_minoverfreeze_agg['Days_Above_Frezing'].rolling(5).mean().fillna(0).astype(np.int64)
df_alfred_winter_minoverfreeze_agg['10YRolling'] = df_alfred_winter_minoverfreeze_agg['Days_Above_Frezing'].rolling(10).mean().fillna(0).astype(np.int64)
df_alfred_winter_minoverfreeze_agg.replace(0, np.nan, inplace=True) #Replace 0s from the rolling average back to Null (NaN)
#df_alfred_winter_maxoverfreeze_agg.info()
#display(df_alfred_winter_maxoverfreeze_agg)
fig_alfred_winter_minoverfreeze = px.line(df_alfred_winter_minoverfreeze_agg, x='WINTER_YEAR', y='Days_Above_Frezing', title='Number of Days With Low Temp. Above Freezing in Alfred, NY')
fig_alfred_winter_minoverfreeze.add_scatter(x=df_alfred_winter_minoverfreeze_agg['WINTER_YEAR'], y=df_alfred_winter_minoverfreeze_agg['5YRolling'], mode='lines', name='5-Year Rolling Average')
fig_alfred_winter_minoverfreeze.add_scatter(x=df_alfred_winter_minoverfreeze_agg['WINTER_YEAR'], y=df_alfred_winter_minoverfreeze_agg['10YRolling'], mode='lines', name='10-Year Rolling Average')
fig_alfred_winter_minoverfreeze.show()
Now here's something!
If my hypothesis is correct that the low temperature might have more of an impact on winter weather, in keeping snow on the ground and preventing rain, this seems to point to confirming what I have been experiencing and suspecting. If the low temperature on a day is
While we don't have good data before 1910, it does look like there are still swings, peaks and valleys, but that pattern ended with a trough in the early 1960s, and had never seen another trough again until 2022 where the number of days with a low above freezing doubled compared to the previous trough!
So let's look at the number of days we had with snow during those winter months
df_alfred_winter_snow = df_alfred_winter[df_alfred_winter['SNOW'] > 0]
df_alfred_winter_snow = df_alfred_winter_snow.loc[df_alfred_winter_snow['WINTER_YEAR'] > 1909]
df_alfred_winter_snow_agg = df_alfred_winter_snow.groupby(df_alfred_winter_snow.WINTER_YEAR)['WINTER_YEAR'].count().reset_index(name='Days_With_Snow')
df_alfred_winter_snow_agg['5YRolling'] = df_alfred_winter_snow_agg['Days_With_Snow'].rolling(5).mean().fillna(0).astype(np.int64)
df_alfred_winter_snow_agg['10YRolling'] = df_alfred_winter_snow_agg['Days_With_Snow'].rolling(10).mean().fillna(0).astype(np.int64)
df_alfred_winter_snow_agg.replace(0, np.nan, inplace=True) #Replace 0s from the rolling average back to Null (NaN)
#display(df_alfred_winter_snow_agg)
with pd.option_context("display.max_rows", 15): display(df_alfred_winter_snow_agg)
| WINTER_YEAR | Days_With_Snow | 5YRolling | 10YRolling | |
|---|---|---|---|---|
| 0 | 1911 | 35 | NaN | NaN |
| 1 | 1912 | 10 | NaN | NaN |
| 2 | 1913 | 22 | NaN | NaN |
| 3 | 1914 | 44 | NaN | NaN |
| 4 | 1915 | 41 | 30.0 | NaN |
| ... | ... | ... | ... | ... |
| 108 | 2020 | 38 | 42.0 | 43.0 |
| 109 | 2021 | 41 | 44.0 | 42.0 |
| 110 | 2022 | 35 | 41.0 | 43.0 |
| 111 | 2023 | 34 | 38.0 | 41.0 |
| 112 | 2024 | 25 | 34.0 | 39.0 |
113 rows × 4 columns
fig_alfred_winter_snow = px.line(df_alfred_winter_snow_agg, x='WINTER_YEAR', y='Days_With_Snow', title='Number of Days With Snow in Alfred, NY')
fig_alfred_winter_snow.add_scatter(x=df_alfred_winter_snow_agg['WINTER_YEAR'], y=df_alfred_winter_snow_agg['5YRolling'], mode='lines', name='5-Year Rolling Average')
fig_alfred_winter_snow.add_scatter(x=df_alfred_winter_snow_agg['WINTER_YEAR'], y=df_alfred_winter_snow_agg['10YRolling'], mode='lines', name='10-Year Rolling Average')
fig_alfred_winter_snow.show()
This confirms my personal experience, but doesn't point to a greater long-term issue
Since the winter of 2010, the first year I was in Alfred, I had noticed that the winters had not been as harsh as before. This confirms this, but does not confirm that these years had been any different than in previous timeframes.
Let's look at the total accumulated snowfall each year, then.
#Take the same dataframe as before where we found all days with SNOW > 0, and sum instead of count
df_alfred_winter_total_snow_agg = df_alfred_winter_snow.groupby(df_alfred_winter_snow.WINTER_YEAR)['SNOW'].sum().reset_index(name='Total_Snowfall')
df_alfred_winter_total_snow_agg['5YRolling'] = df_alfred_winter_total_snow_agg['Total_Snowfall'].rolling(5).mean().fillna(0).astype(np.int64)
df_alfred_winter_total_snow_agg['10YRolling'] = df_alfred_winter_total_snow_agg['Total_Snowfall'].rolling(10).mean().fillna(0).astype(np.int64)
df_alfred_winter_total_snow_agg.replace(0, np.nan, inplace=True) #Replace 0s from the rolling average back to Null (NaN)
with pd.option_context("display.max_rows", 15): display(df_alfred_winter_total_snow_agg)
| WINTER_YEAR | Total_Snowfall | 5YRolling | 10YRolling | |
|---|---|---|---|---|
| 0 | 1911 | 68.9 | NaN | NaN |
| 1 | 1912 | 17.8 | NaN | NaN |
| 2 | 1913 | 28.7 | NaN | NaN |
| 3 | 1914 | 74.3 | NaN | NaN |
| 4 | 1915 | 71.3 | 52.0 | NaN |
| ... | ... | ... | ... | ... |
| 108 | 2020 | 69.2 | 57.0 | 55.0 |
| 109 | 2021 | 41.8 | 60.0 | 53.0 |
| 110 | 2022 | 55.9 | 56.0 | 56.0 |
| 111 | 2023 | 42.9 | 49.0 | 54.0 |
| 112 | 2024 | 27.9 | 47.0 | 50.0 |
113 rows × 4 columns
fig_alfred_winter_total_snow = px.line(df_alfred_winter_total_snow_agg, x='WINTER_YEAR', y='Total_Snowfall', title='Total Annual Snowfall in Alfred, NY')
fig_alfred_winter_total_snow.add_scatter(x=df_alfred_winter_total_snow_agg['WINTER_YEAR'], y=df_alfred_winter_total_snow_agg['5YRolling'], mode='lines', name='5-Year Rolling Average')
fig_alfred_winter_total_snow.add_scatter(x=df_alfred_winter_total_snow_agg['WINTER_YEAR'], y=df_alfred_winter_total_snow_agg['10YRolling'], mode='lines', name='10-Year Rolling Average')
fig_alfred_winter_total_snow.show()
Hmmmm, that's interesting too
It's easy to tell that total snowfall has decreased since a high rolling average in 2009 of ~90 inches to only 50 inches in 2024, but that is now the same average as the 1920s to the 1930s. Interestingly, the total snowfall was steadily increasing over the course of the 20th century, until a drop in the 80s followed by a steep increase until the early 2000s.
Let's take one last look at Average Snow Pack
Before we move on to looking at Dansville to see if our various microclimates in NYS have experienced climate change any differently from each other, let's see if there's been any change to average snow pack between December and March.
df_alfred_winter_snowpack = df_alfred_winter
df_alfred_winter_snowpack = df_alfred_winter_snowpack.loc[df_alfred_winter_snowpack['WINTER_YEAR'] > 1909]
df_alfred_winter_snowpack_agg = df_alfred_winter_snow.groupby(df_alfred_winter_snow.WINTER_YEAR)['SNWD'].mean().reset_index(name='Average_Snowpack')
#df_alfred_winter_snowpack_agg['5YRolling'] = df_alfred_winter_total_snow_agg['Average_Snowpack'].rolling(5).mean().fillna(0).astype(np.int64)
#df_alfred_winter_snowpack_agg['10YRolling'] = df_alfred_winter_total_snow_agg['Average_Snowpack'].rolling(10).mean().fillna(0).astype(np.int64)
#df_alfred_winter_snowpack_agg.replace(0, np.nan, inplace=True) #Replace 0s from the rolling average back to Null (NaN)
with pd.option_context("display.max_rows", 15): display(df_alfred_winter_snowpack_agg)
| WINTER_YEAR | Average_Snowpack | |
|---|---|---|
| 0 | 1911 | NaN |
| 1 | 1912 | NaN |
| 2 | 1913 | NaN |
| 3 | 1914 | 10.204545 |
| 4 | 1915 | 10.975610 |
| ... | ... | ... |
| 108 | 2020 | 3.552632 |
| 109 | 2021 | 4.439024 |
| 110 | 2022 | 4.257143 |
| 111 | 2023 | 2.117647 |
| 112 | 2024 | 1.600000 |
113 rows × 2 columns
We have some missing data, but let's see how it looks
fig_alfred_winter_snowpack = px.line(df_alfred_winter_snowpack_agg, x='WINTER_YEAR', y='Average_Snowpack', title='Average Annual Snowpack in Alfred, NY')
#fig_alfred_winter_snowpack.add_scatter(x=df_alfred_winter_snowpack_agg['WINTER_YEAR'], y=df_alfred_winter_snowpack_agg['5YRolling'], mode='lines', name='5-Year Rolling Average')
#fig_alfred_winter_snowpack.add_scatter(x=df_alfred_winter_snowpack_agg['WINTER_YEAR'], y=df_alfred_winter_snowpack_agg['10YRolling'], mode='lines', name='10-Year Rolling Average')
fig_alfred_winter_snowpack.show()
Not much to see here
Unfortunately there is a good chunk of data missing between 1993 and 2010, which also impact the use of moving averages so there isn't much to look at here. At the least we can see lower figures since 2014 on average compared to the previous years, but it's a bit unscientific to draw a conclusion.
Let's look at Dansville's weather data
pd.set_option('display.max_rows',None) #show all rows for this section
df_dansville_winter = df_dansville.loc[df_dansville['DATE'].dt.month.isin([12, 1, 2, 3])] #Create a new frame with only months in winter
df_dansville_winter.insert(5,'WINTER_YEAR',df_dansville_winter['DATE'].dt.year)
#del df_dansville_winter['WINTER_YEAR']
#Replace the year of the date field in the new column with 1 year higher if in October, November, or December
#df_alfred_winter.loc['DATE'].dt.month.isin[10, 11, 12])
df_dansville_winter['WINTER_YEAR'] = np.where(df_dansville_winter['DATE'].dt.month.isin([10, 11, 12]), df_dansville_winter['WINTER_YEAR']+1, df_dansville_winter['WINTER_YEAR'])
with pd.option_context("display.max_rows", 15): display(df_dansville_winter)
C:\Users\rjcil\AppData\Local\Temp\ipykernel_23820\1902125297.py:8: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| FILE_NAME | LOCATION | STATION | NAME | DATE | WINTER_YEAR | PRCP | SNOW | SNWD | TMAX | TMIN | TOBS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 153 | NOAA Daily Summaries - Alfred, NY.csv | Dansville, NY | USC00301974 | DANSVILLE, NY US | 1918-02-01 | 1918 | 0.00 | 0.0 | NaN | 32.0 | 8.0 | 26.0 |
| 154 | NOAA Daily Summaries - Alfred, NY.csv | Dansville, NY | USC00301974 | DANSVILLE, NY US | 1918-02-02 | 1918 | 0.10 | 1.0 | NaN | 26.0 | 15.0 | 16.0 |
| 155 | NOAA Daily Summaries - Alfred, NY.csv | Dansville, NY | USC00301974 | DANSVILLE, NY US | 1918-02-03 | 1918 | 0.03 | 0.3 | NaN | 34.0 | 10.0 | 28.0 |
| 156 | NOAA Daily Summaries - Alfred, NY.csv | Dansville, NY | USC00301974 | DANSVILLE, NY US | 1918-02-04 | 1918 | 0.00 | 0.0 | NaN | 50.0 | 28.0 | 44.0 |
| 157 | NOAA Daily Summaries - Alfred, NY.csv | Dansville, NY | USC00301974 | DANSVILLE, NY US | 1918-02-05 | 1918 | 0.00 | 0.0 | NaN | 44.0 | 15.0 | 15.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 37609 | NOAA Daily Summaries - Alfred, NY.csv | Dansville, NY | USC00301974 | DANSVILLE, NY US | 2024-03-27 | 2024 | 0.00 | 0.0 | 0.0 | 59.0 | 43.0 | 52.0 |
| 37610 | NOAA Daily Summaries - Alfred, NY.csv | Dansville, NY | USC00301974 | DANSVILLE, NY US | 2024-03-28 | 2024 | 0.00 | 0.0 | 0.0 | 67.0 | 36.0 | 37.0 |
| 37611 | NOAA Daily Summaries - Alfred, NY.csv | Dansville, NY | USC00301974 | DANSVILLE, NY US | 2024-03-29 | 2024 | 0.00 | 0.0 | 0.0 | 50.0 | 25.0 | 27.0 |
| 37612 | NOAA Daily Summaries - Alfred, NY.csv | Dansville, NY | USC00301974 | DANSVILLE, NY US | 2024-03-30 | 2024 | 0.00 | 0.0 | 0.0 | 40.0 | 27.0 | 27.0 |
| 37613 | NOAA Daily Summaries - Alfred, NY.csv | Dansville, NY | USC00301974 | DANSVILLE, NY US | 2024-03-31 | 2024 | 0.01 | 0.0 | 0.0 | 52.0 | 27.0 | 38.0 |
12444 rows × 12 columns
df_dansville_year_rows = df_dansville.groupby(df_dansville.DATE.dt.year)['DATE'].count().reset_index(name='Rows') #Create a new frame, counting rows grouped by year
df_dansville_year_rows_rslt = df_dansville_year_rows.loc[df_dansville_year_rows['Rows'] < 365] #Create a new frame with only the years that have less than 365 rows
display(df_dansville_year_rows_rslt)
print('This many years are missing some data:',len(df_dansville_year_rows_rslt.index))
| DATE | Rows | |
|---|---|---|
| 0 | 1917 | 153 |
| 1 | 1918 | 282 |
| 17 | 1934 | 296 |
| 22 | 1939 | 334 |
| 23 | 1941 | 1 |
| 24 | 1942 | 306 |
| 39 | 1957 | 364 |
| 69 | 1987 | 182 |
| 77 | 1995 | 331 |
| 89 | 2007 | 243 |
| 92 | 2010 | 334 |
| 106 | 2024 | 237 |
This many years are missing some data: 12
df_dansville_winter_rows = df_dansville_winter.groupby(df_dansville_winter.DATE.dt.year)['DATE'].count().reset_index(name='Rows') #Create a new frame, counting rows grouped by year
df_dansville_winter_rows_rslt = df_dansville_winter_rows.loc[df_dansville_winter_rows['Rows'] < 121] #Create a new frame with only the years that have less than 121 rows
display(df_dansville_winter_rows_rslt)
print('This many years are missing some winter data:',len(df_dansville_winter_rows_rslt.index))
| DATE | Rows | |
|---|---|---|
| 0 | 1918 | 59 |
| 16 | 1934 | 52 |
| 21 | 1939 | 90 |
| 22 | 1942 | 62 |
| 75 | 1995 | 120 |
| 87 | 2007 | 90 |
| 104 | 2024 | 91 |
This many years are missing some winter data: 7
We're off to a better start looking at Dansville
Luckily less of the data is missing here at the Dansville station. I still want to look at Dansville first because Bath's data has multiple different stations that changed locations and have overlapping entries which makes it much more troublesome to deal with.
df_dansville_winter_maxoverfreeze = df_dansville_winter[df_dansville_winter['TMAX'] > 32]
with pd.option_context("display.max_rows", 15): display(df_dansville_winter_maxoverfreeze)
#df_dansville_winter_maxoverfreeze_agg = df_dansville_winter_maxoverfreeze.groupby(df_dansville_winter_maxoverfreeze.DATE.dt.year)['DATE'].count().reset_index(name='Days_Above_Frezing')
df_dansville_winter_maxoverfreeze_agg = df_dansville_winter_maxoverfreeze.groupby(df_dansville_winter_maxoverfreeze.WINTER_YEAR)['WINTER_YEAR'].count().reset_index(name='Days_Above_Frezing')
#display(df_dansville_winter_maxoverfreeze_agg)
#add a new rolling/moving average column
df_dansville_winter_maxoverfreeze_agg['5YRolling'] = df_dansville_winter_maxoverfreeze_agg['Days_Above_Frezing'].rolling(5).mean().fillna(0).astype(np.int64)
df_dansville_winter_maxoverfreeze_agg['10YRolling'] = df_dansville_winter_maxoverfreeze_agg['Days_Above_Frezing'].rolling(10).mean().fillna(0).astype(np.int64)
df_dansville_winter_maxoverfreeze_agg.replace(0, np.nan, inplace=True) #Replace 0s from the rolling average back to Null (NaN)
#df_alfred_winter_maxoverfreeze_agg.info()
#display(df_alfred_winter_maxoverfreeze_agg)
fig_dansville_winter_maxoverfreeze = px.line(df_dansville_winter_maxoverfreeze_agg, x='WINTER_YEAR', y='Days_Above_Frezing', title='Number of Days With High Temp. Above Freezing in Dansville, NY')
fig_dansville_winter_maxoverfreeze.add_scatter(x=df_dansville_winter_maxoverfreeze_agg['WINTER_YEAR'], y=df_dansville_winter_maxoverfreeze_agg['5YRolling'], mode='lines', name='5-Year Rolling Average')
fig_dansville_winter_maxoverfreeze.add_scatter(x=df_dansville_winter_maxoverfreeze_agg['WINTER_YEAR'], y=df_dansville_winter_maxoverfreeze_agg['10YRolling'], mode='lines', name='10-Year Rolling Average')
fig_dansville_winter_maxoverfreeze.show()
| FILE_NAME | LOCATION | STATION | NAME | DATE | WINTER_YEAR | PRCP | SNOW | SNWD | TMAX | TMIN | TOBS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 155 | NOAA Daily Summaries - Alfred, NY.csv | Dansville, NY | USC00301974 | DANSVILLE, NY US | 1918-02-03 | 1918 | 0.03 | 0.3 | NaN | 34.0 | 10.0 | 28.0 |
| 156 | NOAA Daily Summaries - Alfred, NY.csv | Dansville, NY | USC00301974 | DANSVILLE, NY US | 1918-02-04 | 1918 | 0.00 | 0.0 | NaN | 50.0 | 28.0 | 44.0 |
| 157 | NOAA Daily Summaries - Alfred, NY.csv | Dansville, NY | USC00301974 | DANSVILLE, NY US | 1918-02-05 | 1918 | 0.00 | 0.0 | NaN | 44.0 | 15.0 | 15.0 |
| 159 | NOAA Daily Summaries - Alfred, NY.csv | Dansville, NY | USC00301974 | DANSVILLE, NY US | 1918-02-07 | 1918 | 0.00 | 0.0 | NaN | 38.0 | 16.0 | 35.0 |
| 160 | NOAA Daily Summaries - Alfred, NY.csv | Dansville, NY | USC00301974 | DANSVILLE, NY US | 1918-02-08 | 1918 | 0.50 | 0.0 | NaN | 58.0 | 34.0 | 48.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 37609 | NOAA Daily Summaries - Alfred, NY.csv | Dansville, NY | USC00301974 | DANSVILLE, NY US | 2024-03-27 | 2024 | 0.00 | 0.0 | 0.0 | 59.0 | 43.0 | 52.0 |
| 37610 | NOAA Daily Summaries - Alfred, NY.csv | Dansville, NY | USC00301974 | DANSVILLE, NY US | 2024-03-28 | 2024 | 0.00 | 0.0 | 0.0 | 67.0 | 36.0 | 37.0 |
| 37611 | NOAA Daily Summaries - Alfred, NY.csv | Dansville, NY | USC00301974 | DANSVILLE, NY US | 2024-03-29 | 2024 | 0.00 | 0.0 | 0.0 | 50.0 | 25.0 | 27.0 |
| 37612 | NOAA Daily Summaries - Alfred, NY.csv | Dansville, NY | USC00301974 | DANSVILLE, NY US | 2024-03-30 | 2024 | 0.00 | 0.0 | 0.0 | 40.0 | 27.0 | 27.0 |
| 37613 | NOAA Daily Summaries - Alfred, NY.csv | Dansville, NY | USC00301974 | DANSVILLE, NY US | 2024-03-31 | 2024 | 0.01 | 0.0 | 0.0 | 52.0 | 27.0 | 38.0 |
8311 rows × 12 columns
df_dansville_winter_minoverfreeze = df_dansville_winter[df_dansville_winter['TMIN'] > 32]
df_dansville_winter_minoverfreeze = df_dansville_winter_minoverfreeze.loc[df_dansville_winter['WINTER_YEAR'] > 1909]
df_dansville_winter_minoverfreeze_agg = df_dansville_winter_minoverfreeze.groupby(df_dansville_winter_minoverfreeze.WINTER_YEAR)['WINTER_YEAR'].count().reset_index(name='Days_Above_Frezing')
#display(df_dansville_winter_maxoverfreeze_agg)
#add a new rolling/moving average column
df_dansville_winter_minoverfreeze_agg['5YRolling'] = df_dansville_winter_minoverfreeze_agg['Days_Above_Frezing'].rolling(5).mean().fillna(0).astype(np.int64)
df_dansville_winter_minoverfreeze_agg['10YRolling'] = df_dansville_winter_minoverfreeze_agg['Days_Above_Frezing'].rolling(10).mean().fillna(0).astype(np.int64)
df_dansville_winter_minoverfreeze_agg.replace(0, np.nan, inplace=True) #Replace 0s from the rolling average back to Null (NaN)
#df_dansville_winter_maxoverfreeze_agg.info()
#display(df_dansville_winter_maxoverfreeze_agg)
fig_dansville_winter_minoverfreeze = px.line(df_dansville_winter_minoverfreeze_agg, x='WINTER_YEAR', y='Days_Above_Frezing', title='Number of Days With Low Temp. Above Freezing in Dansville, NY',)#.update_traces(visible="legendonly", selector=lambda t: not t.name in ["Days_Above_Frezing"])
fig_dansville_winter_minoverfreeze.add_scatter(x=df_dansville_winter_minoverfreeze_agg['WINTER_YEAR'], y=df_dansville_winter_minoverfreeze_agg['5YRolling'], mode='lines', name='5-Year Rolling Average')
fig_dansville_winter_minoverfreeze.add_scatter(x=df_dansville_winter_minoverfreeze_agg['WINTER_YEAR'], y=df_dansville_winter_minoverfreeze_agg['10YRolling'], mode='lines', name='10-Year Rolling Average')
fig_dansville_winter_minoverfreeze.show()
This tells a very different story
Dansville does not seem to have seen an increase in days at high temperatures during winter. It's amazing how much of a difference a 30 minute drive can make in New York in terms of weather.
Let's see if there's any difference in snowfall.
df_dansville_winter_snow = df_dansville_winter[df_dansville_winter['SNOW'] > 0]
df_dansville_winter_snow = df_dansville_winter_snow.loc[df_dansville_winter_snow['WINTER_YEAR'] > 1909]
df_dansville_winter_snow_agg = df_dansville_winter_snow.groupby(df_dansville_winter_snow.WINTER_YEAR)['WINTER_YEAR'].count().reset_index(name='Days_With_Snow')
df_dansville_winter_snow_agg['5YRolling'] = df_dansville_winter_snow_agg['Days_With_Snow'].rolling(5).mean().fillna(0).astype(np.int64)
df_dansville_winter_snow_agg['10YRolling'] = df_dansville_winter_snow_agg['Days_With_Snow'].rolling(10).mean().fillna(0).astype(np.int64)
df_dansville_winter_snow_agg.replace(0, np.nan, inplace=True) #Replace 0s from the rolling average back to Null (NaN)
with pd.option_context("display.max_rows", 15): display(df_dansville_winter_snow_agg)
| WINTER_YEAR | Days_With_Snow | 5YRolling | 10YRolling | |
|---|---|---|---|---|
| 0 | 1918 | 9 | NaN | NaN |
| 1 | 1919 | 8 | NaN | NaN |
| 2 | 1920 | 13 | NaN | NaN |
| 3 | 1921 | 11 | NaN | NaN |
| 4 | 1922 | 9 | 10.0 | NaN |
| ... | ... | ... | ... | ... |
| 96 | 2020 | 13 | 15.0 | 16.0 |
| 97 | 2021 | 12 | 15.0 | 14.0 |
| 98 | 2022 | 14 | 15.0 | 15.0 |
| 99 | 2023 | 10 | 13.0 | 15.0 |
| 100 | 2024 | 12 | 12.0 | 14.0 |
101 rows × 4 columns
fig_dansville_winter_snow = px.line(df_dansville_winter_snow_agg, x='WINTER_YEAR', y='Days_With_Snow', title='Number of Days With Snow in Dansville, NY')
fig_dansville_winter_snow.add_scatter(x=df_dansville_winter_snow_agg['WINTER_YEAR'], y=df_dansville_winter_snow_agg['5YRolling'], mode='lines', name='5-Year Rolling Average')
fig_dansville_winter_snow.add_scatter(x=df_dansville_winter_snow_agg['WINTER_YEAR'], y=df_dansville_winter_snow_agg['10YRolling'], mode='lines', name='10-Year Rolling Average')
fig_dansville_winter_snow.show()
#Take the same dataframe as before where we found all days with SNOW > 0, and sum instead of count
df_dansville_winter_total_snow_agg = df_dansville_winter_snow.groupby(df_dansville_winter_snow.WINTER_YEAR)['SNOW'].sum().reset_index(name='Total_Snowfall')
df_dansville_winter_total_snow_agg['5YRolling'] = df_dansville_winter_total_snow_agg['Total_Snowfall'].rolling(5).mean().fillna(0).astype(np.int64)
df_dansville_winter_total_snow_agg['10YRolling'] = df_dansville_winter_total_snow_agg['Total_Snowfall'].rolling(10).mean().fillna(0).astype(np.int64)
df_dansville_winter_total_snow_agg.replace(0, np.nan, inplace=True) #Replace 0s from the rolling average back to Null (NaN)
with pd.option_context("display.max_rows", 15): display(df_dansville_winter_total_snow_agg)
| WINTER_YEAR | Total_Snowfall | 5YRolling | 10YRolling | |
|---|---|---|---|---|
| 0 | 1918 | 6.5 | NaN | NaN |
| 1 | 1919 | 23.3 | NaN | NaN |
| 2 | 1920 | 23.7 | NaN | NaN |
| 3 | 1921 | 15.8 | NaN | NaN |
| 4 | 1922 | 17.1 | 17.0 | NaN |
| ... | ... | ... | ... | ... |
| 96 | 2020 | 23.8 | 29.0 | 31.0 |
| 97 | 2021 | 29.4 | 32.0 | 29.0 |
| 98 | 2022 | 31.7 | 33.0 | 31.0 |
| 99 | 2023 | 15.1 | 26.0 | 30.0 |
| 100 | 2024 | 13.0 | 22.0 | 27.0 |
101 rows × 4 columns
fig_dansville_winter_total_snow = px.line(df_dansville_winter_total_snow_agg, x='WINTER_YEAR', y='Total_Snowfall', title='Total Annual Snowfall in Dansville, NY')
fig_dansville_winter_total_snow.add_scatter(x=df_dansville_winter_total_snow_agg['WINTER_YEAR'], y=df_dansville_winter_total_snow_agg['5YRolling'], mode='lines', name='5-Year Rolling Average')
fig_dansville_winter_total_snow.add_scatter(x=df_dansville_winter_total_snow_agg['WINTER_YEAR'], y=df_dansville_winter_total_snow_agg['10YRolling'], mode='lines', name='10-Year Rolling Average')
fig_dansville_winter_total_snow.show()
Is 100 years of data enough?
We tend to generally understand that Earth's climate changes in cycles - this would be a very long cycle. But we might be able to discern that we are on the downward side, the ending side of another cycle, which does not bode well for the next decade.
From the 1920s through the 1940s it looks like there was a steady average of 20-25 inches of snow throughout winter in Dansville, increasing in the 50s through to 1980, and then starting a downward trend. What's noticable is the determined drop in snowfall into the Nineties below the previous long-range average in the early 20th century. While snowfall trended upwards through the 2000s until 2020, it looks like Dansville is trending downward again and the peak rolling average was only about 60% of the previous cycle's rolling peak.
I am concerned by this. If we are truly on a downward trend we might continue seeing new low averages, in terms of annual snowfall, lower high averages, and lower annual snowfall in general. If the cycles last this long then I might not be confident in making a definitive statement, but I am still concerned at what this suggests right now.
One more time, with feeling - what about snowpack?
df_dansville_winter_snowpack = df_dansville_winter
df_dansville_winter_snowpack = df_dansville_winter_snowpack.loc[df_dansville_winter_snowpack['WINTER_YEAR'] > 1909]
df_dansville_winter_snowpack_agg = df_dansville_winter_snow.groupby(df_dansville_winter_snow.WINTER_YEAR)['SNWD'].mean().reset_index(name='Average_Snowpack')
#df_dansville_winter_snowpack_agg['5YRolling'] = df_dansville_winter_total_snow_agg['Average_Snowpack'].rolling(5).mean().fillna(0).astype(np.int64)
#df_dansville_winter_snowpack_agg['10YRolling'] = df_dansville_winter_total_snow_agg['Average_Snowpack'].rolling(10).mean().fillna(0).astype(np.int64)
#df_dansville_winter_snowpack_agg.replace(0, np.nan, inplace=True) #Replace 0s from the rolling average back to Null (NaN)
with pd.option_context("display.max_rows", 15): display(df_dansville_winter_snowpack_agg)
| WINTER_YEAR | Average_Snowpack | |
|---|---|---|
| 0 | 1918 | NaN |
| 1 | 1919 | 2.666667 |
| 2 | 1920 | 4.750000 |
| 3 | 1921 | 2.666667 |
| 4 | 1922 | 2.166667 |
| ... | ... | ... |
| 96 | 2020 | 2.923077 |
| 97 | 2021 | 3.666667 |
| 98 | 2022 | 4.714286 |
| 99 | 2023 | 1.500000 |
| 100 | 2024 | 1.250000 |
101 rows × 2 columns
fig_dansville_winter_snowpack = px.line(df_dansville_winter_snowpack_agg, x='WINTER_YEAR', y='Average_Snowpack', title='Average Annual Snowpack in Dansville, NY')
#fig_dansville_winter_snowpack.add_scatter(x=df_dansville_winter_snowpack_agg['WINTER_YEAR'], y=df_dansville_winter_snowpack_agg['5YRolling'], mode='lines', name='5-Year Rolling Average')
#fig_dansville_winter_snowpack.add_scatter(x=df_dansville_winter_snowpack_agg['WINTER_YEAR'], y=df_dansville_winter_snowpack_agg['10YRolling'], mode='lines', name='10-Year Rolling Average')
fig_dansville_winter_snowpack.show()
It still seems that our snow depth data is a little lacking.
Let's start looking at Bath, then.
As a reminder, here is what data we have available. As mentioned before, from NOAA's notes, Coverage is an approximation of total completeness based on the most complete data element, and the overall data range. But also, consistency is just as important as accuracy with data so let's see what data we can cull so we have a single row of data for each day in similar locations over time.
Bath, NY
The Bath, NY data is comprised of data in 6 stations in different locations since 1953.
The overall "Bath, NY" station covers from 1953/08/01 until 2014/09/30, so we'll start there. The Avoca station covers dates that are all already reported by the "Bath, NY" station so we'll just get rid of all of the Avoca data, same with Bath 0.4. Bath 4.2 then covers up to 2024/08/25, but has some earlier data that overlaps the original "Bath, NY" station, so let's cull Bath 4.2 E earlier than 2014/09/30. The other stations, Bath all exist in shorter timeframes that are already covered by the other stations we are using, so let's get rid of them outright.
#Filtering the Bath, NY Dataframe
#df_alfred_winter = df_alfred.loc[df_alfred['DATE'].dt.month.isin([12, 1, 2, 3])]
print('The raw Bath, NY dataset contains:')
df_bath.info()
print('\n')
print('We can validate our removed rows by checking the following:')
df_bath_stations = df_bath.groupby(df_bath.NAME)['NAME'].count().reset_index(name='Rows') #Create a new frame, counting rows grouped by year
display(df_bath_stations)
print('\n')
print('Here is how many rows we want to delete from Bath 4.2 that are earlier than 2014/09/30:')
df_bath_42 = df_bath[(df_bath['NAME'].isin(['BATH 4.2 E, NY US']))]
df_bath_42['DELETE_FLAG'] = ''
df_bath_42['DELETE_FLAG'] = np.where(df_bath_42['DATE'] <= '2014-09-30', 1, 0)
df_bath_42_count = df_bath_42.groupby(df_bath_42.DELETE_FLAG)['DELETE_FLAG'].count().reset_index(name='Rows')
display(df_bath_42_count)
df_bath_filtered = df_bath[~(
(df_bath['NAME'].isin(['AVOCA 5.0 SSW, NY US', 'BATH 0.4 N, NY US', 'BATH 6.8 SW, NY US', 'BATH 1.3 E, NY US']))
|
((df_bath['NAME'].isin(['BATH 4.2 E, NY US'])) & (df_bath['DATE'] <= '2014-09-30'))
)
]
print('\n')
print('The new dataset contains:')
df_bath_filtered.info()
The raw Bath, NY dataset contains: <class 'pandas.core.frame.DataFrame'> RangeIndex: 24741 entries, 0 to 24740 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 FILE_NAME 24741 non-null object 1 LOCATION 24741 non-null object 2 STATION 24741 non-null object 3 NAME 24741 non-null object 4 DATE 24741 non-null datetime64[ns] 5 PRCP 24412 non-null float64 6 SNOW 23025 non-null float64 7 SNWD 20704 non-null float64 8 TMAX 17315 non-null float64 9 TMIN 17363 non-null float64 10 TOBS 17383 non-null float64 dtypes: datetime64[ns](1), float64(6), object(4) memory usage: 2.1+ MB We can validate our removed rows by checking the following:
| NAME | Rows | |
|---|---|---|
| 0 | AVOCA 5.0 SSW, NY US | 193 |
| 1 | BATH 0.4 N, NY US | 300 |
| 2 | BATH 1.3 E, NY US | 628 |
| 3 | BATH 4.2 E, NY US | 2488 |
| 4 | BATH 6.8 SW, NY US | 811 |
| 5 | BATH, NY US | 20321 |
Here is how many rows we want to delete from Bath 4.2 that are earlier than 2014/09/30:
C:\Users\rjcil\AppData\Local\Temp\ipykernel_23820\3778609652.py:14: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\rjcil\AppData\Local\Temp\ipykernel_23820\3778609652.py:15: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| DELETE_FLAG | Rows | |
|---|---|---|
| 0 | 0 | 1898 |
| 1 | 1 | 590 |
The new dataset contains: <class 'pandas.core.frame.DataFrame'> Index: 22219 entries, 1111 to 24112 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 FILE_NAME 22219 non-null object 1 LOCATION 22219 non-null object 2 STATION 22219 non-null object 3 NAME 22219 non-null object 4 DATE 22219 non-null datetime64[ns] 5 PRCP 22111 non-null float64 6 SNOW 21813 non-null float64 7 SNWD 20575 non-null float64 8 TMAX 17315 non-null float64 9 TMIN 17363 non-null float64 10 TOBS 17383 non-null float64 dtypes: datetime64[ns](1), float64(6), object(4) memory usage: 2.0+ MB
The data for Bath is cleaned, let's get started!
We counted the rows from each station so we know that we want to get rid of 1,932 rows from Avoca, Bath 0.4, Bath 6.8, and Bath 1.3. We also need to remove 590 rows from Bath 4.2 where its reporting date is earlier than 2014/09/30. That should leave 22,219 rows.
Now let's reduce the Bath data to just the winter months again.
df_bath_winter = df_bath_filtered.loc[df_bath_filtered['DATE'].dt.month.isin([12, 1, 2, 3])] #Create a new frame with only months in winter
df_bath_winter.insert(5,'WINTER_YEAR',df_bath_winter['DATE'].dt.year)
#del df_dansville_winter['WINTER_YEAR']
#Replace the year of the date field in the new column with 1 year higher if in October, November, or December
df_bath_winter['WINTER_YEAR'] = np.where(df_bath_winter['DATE'].dt.month.isin([10, 11, 12]), df_bath_winter['WINTER_YEAR']+1, df_bath_winter['WINTER_YEAR'])
with pd.option_context("display.max_rows", 15): display(df_bath_winter)
C:\Users\rjcil\AppData\Local\Temp\ipykernel_23820\333077909.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| FILE_NAME | LOCATION | STATION | NAME | DATE | WINTER_YEAR | PRCP | SNOW | SNWD | TMAX | TMIN | TOBS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1233 | NOAA Daily Summaries - Alfred, NY.csv | Bath, NY | USC00300448 | BATH, NY US | 1953-12-01 | 1954 | 0.00 | 0.0 | 0.0 | NaN | NaN | NaN |
| 1234 | NOAA Daily Summaries - Alfred, NY.csv | Bath, NY | USC00300448 | BATH, NY US | 1953-12-02 | 1954 | 0.00 | 0.0 | 0.0 | NaN | NaN | NaN |
| 1235 | NOAA Daily Summaries - Alfred, NY.csv | Bath, NY | USC00300448 | BATH, NY US | 1953-12-03 | 1954 | 0.00 | 0.0 | 0.0 | NaN | NaN | NaN |
| 1236 | NOAA Daily Summaries - Alfred, NY.csv | Bath, NY | USC00300448 | BATH, NY US | 1953-12-04 | 1954 | 0.00 | 0.0 | 0.0 | NaN | NaN | NaN |
| 1237 | NOAA Daily Summaries - Alfred, NY.csv | Bath, NY | USC00300448 | BATH, NY US | 1953-12-05 | 1954 | 0.23 | 0.0 | 0.0 | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 23998 | NOAA Daily Summaries - Alfred, NY.csv | Bath, NY | US1NYST0033 | BATH 4.2 E, NY US | 2024-03-26 | 2024 | 0.00 | 0.0 | 0.0 | NaN | NaN | NaN |
| 23999 | NOAA Daily Summaries - Alfred, NY.csv | Bath, NY | US1NYST0033 | BATH 4.2 E, NY US | 2024-03-27 | 2024 | 0.00 | 0.0 | 0.0 | NaN | NaN | NaN |
| 24000 | NOAA Daily Summaries - Alfred, NY.csv | Bath, NY | US1NYST0033 | BATH 4.2 E, NY US | 2024-03-28 | 2024 | 0.00 | 0.0 | 0.0 | NaN | NaN | NaN |
| 24001 | NOAA Daily Summaries - Alfred, NY.csv | Bath, NY | US1NYST0033 | BATH 4.2 E, NY US | 2024-03-30 | 2024 | 0.00 | 0.0 | 0.0 | NaN | NaN | NaN |
| 24002 | NOAA Daily Summaries - Alfred, NY.csv | Bath, NY | US1NYST0033 | BATH 4.2 E, NY US | 2024-03-31 | 2024 | 0.00 | 0.0 | 0.0 | NaN | NaN | NaN |
7097 rows × 12 columns
df_bath_year_rows = df_bath_filtered.groupby(df_bath_filtered.DATE.dt.year)['DATE'].count().reset_index(name='Rows') #Create a new frame, counting rows grouped by year
df_bath_year_rows_rslt = df_bath_year_rows.loc[df_bath_year_rows['Rows'] < 365] #Create a new frame with only the years that have less than 365 rows
display(df_bath_year_rows_rslt)
print('This many years are missing some data:',len(df_bath_year_rows_rslt.index))
| DATE | Rows | |
|---|---|---|
| 0 | 1953 | 153 |
| 22 | 1975 | 90 |
| 23 | 1976 | 263 |
| 31 | 1984 | 335 |
| 42 | 1995 | 363 |
| 43 | 1996 | 336 |
| 44 | 1997 | 303 |
| 45 | 1998 | 334 |
| 46 | 1999 | 334 |
| 47 | 2000 | 304 |
| 48 | 2001 | 337 |
| 49 | 2002 | 334 |
| 50 | 2003 | 364 |
| 52 | 2005 | 120 |
| 53 | 2008 | 31 |
| 54 | 2009 | 364 |
| 55 | 2010 | 364 |
| 56 | 2011 | 360 |
| 57 | 2012 | 358 |
| 58 | 2013 | 360 |
| 59 | 2014 | 336 |
| 60 | 2015 | 183 |
| 61 | 2016 | 172 |
| 62 | 2017 | 177 |
| 63 | 2018 | 165 |
| 64 | 2019 | 172 |
| 65 | 2020 | 173 |
| 66 | 2021 | 75 |
| 67 | 2022 | 199 |
| 68 | 2023 | 333 |
| 69 | 2024 | 185 |
This many years are missing some data: 31
df_bath_winter_rows = df_bath_winter.groupby(df_bath_winter.DATE.dt.year)['DATE'].count().reset_index(name='Rows') #Create a new frame, counting rows grouped by year
df_bath_winter_rows_rslt = df_bath_winter_rows.loc[df_bath_winter_rows['Rows'] < 121] #Create a new frame with only the years that have less than 121 rows
display(df_bath_winter_rows_rslt)
print('This many years are missing some winter data:',len(df_bath_winter_rows_rslt.index))
| DATE | Rows | |
|---|---|---|
| 0 | 1953 | 31 |
| 22 | 1975 | 90 |
| 23 | 1976 | 31 |
| 42 | 1995 | 119 |
| 44 | 1997 | 89 |
| 46 | 1999 | 120 |
| 47 | 2000 | 91 |
| 48 | 2001 | 93 |
| 49 | 2002 | 90 |
| 50 | 2003 | 120 |
| 52 | 2005 | 59 |
| 53 | 2008 | 31 |
| 54 | 2009 | 120 |
| 55 | 2010 | 120 |
| 57 | 2012 | 114 |
| 58 | 2013 | 116 |
| 59 | 2014 | 112 |
| 60 | 2015 | 25 |
| 61 | 2016 | 21 |
| 62 | 2017 | 26 |
| 63 | 2018 | 21 |
| 64 | 2019 | 25 |
| 65 | 2020 | 21 |
| 66 | 2022 | 21 |
| 67 | 2023 | 104 |
| 68 | 2024 | 75 |
This many years are missing some winter data: 26
This is a little disappointing
Unfortunately, Bath does not have robust data in the most recent years that I was hoping to study, so we will have to leave it there.
Conclusions
We told the story along the way but,